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In Lattice QCD computations a substantial amount of work is spent in solving the Dirac equation. 
In the recent past it has been observed that conventional Krylov solvers tend to critically slow 
down for large lattices and small quark masses. We present a Schwarz alternating procedure 
(SAP) multilevel method as a solver for the Clover improved Wilson discretization of the Dirac 
equation. This approach combines two components (SAP and algebraic multigrid) that have 
separately been used in lattice QCD before. In combination with a bootstrap setup procedure 
we show that considerable speed-up over conventional Krylov subspace methods for realistic 
configurations can be achieved. 
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1. Introduction 

The most costly task in lattice QCD computations is the solution of large sparse linear systems 
of equations 

Dz = b, (1.1) 

where D is a discretization of the Dirac operator. Here we consider the Wilson discretization 
D = D{U) + m which couples only nearest neighbors and depends on a gauge field U and a mass 
parameter m. Usually z is calculated by a Krylov subspace method (e.g. CGN, GCR, BiCGStab). 
Those methods suffer from critical slowing down when approaching the critical mass as well as 
lattice spacing a = 0. Thus it is of utmost importance to develop preconditioners for these methods 
that remedy these scaling problems. 

In the recent past preconditioners based on domain decomposition (DD) for the solution 
of (1.1) have been proposed in [1]. Although DD methods excel in supercomputing environments 
due to their high inherent parallelism they are unable to remedy the scaling problems completely 
unless they are combined with a multilevel approach. Thus we combine the DD approach with an 
algebraic multigrid hierarchy based on a bootstrap aggregation framework [2, 3]. Our approach 
is similar in construction to the one introduced in [4, 5, 6] where it has been shown that using 
such algebraic multigrid approaches can remedy the scaling problems in QCD computations. As 
in [4, 5, 6] we obtain a multilevel hierarchy using non-smoothed aggregation. The difference is 
that we replace the multigrid smoother by a DD approach, expecting a gain in efficiency on highly 
parallel machines. 

In section 2 we introduce the concept of DD methods before we explain the construction of 
our algebraic multilevel method in some detail in section 3. Thereafter we give numerical results 
of our method in section 4 and finish with some concluding remarks. 

2. Domain Decomposition 

Domain decomposition methods were developed as iterative solvers for linear systems arising 
from discretizations of PDEs. The main idea consists of solving the system on the whole domain 
by repeatedly solving smaller systems with less degrees of freedom on smaller subdomains. 

Consider a block decomposition {Jt?j : i = 1 , . . . , k} of a lattice Jz? (Figure 1 illustrates a 2D 
example). The corresponding trivial embeddings and block solvers are denoted by I<g { : Jz^ — > «£? 
and Bi = /^.[/^D/jf.] -1 /^,,. Note that the trivial embedding 1%, is just the restriction of the identity 
on Jz? to Jz^-. Then one iteration of a domain decomposition method consists of solving each of 
the block systems e <— B[r, interleaved with a number of residual updates r = b — Dz. In the two 
extreme cases where one does only one residual update before solving all block systems or when 
the residual is updated after each block solution, the corresponding error propagators are given by 

k k 
1-(J»A and H(l -BA). (2.1) 

i=l i=l 

These methods go back to H. Schwarz [7] and thus are called additive Schwarz and multiplicative 
Schwarz method, respectively. The block systems of the additive variant can be solved simultane- 
ously while the multiplicative valiant is inherently sequential. Its advantage is that it spreads the 
information faster on the lattice as a solution of a block system uses previous solutions of other 
block systems. 
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Algorithm 2.1 Red-Black Schwarz 

1: for c = 1 to 2 do 

2: r «— b — Dz 

3: for all i 6 {1, ...,k} with color (i) = c do 

4: Z^Z + Bjr 

5: end for 

6: end for 



The two methods can be combined to exploit advantages of both methods. Coloring the blocks 
such that no adjacent blocks have the same color, a residual update on one block no longer influ- 
ences the residual on blocks of the same color. Thus it suffices to perform the update once for each 
color. All blocks of the same color can then be computed simultaneously as described in Algo- 
rithm 2.1. Such a DD approach has been applied to solve (1.1) in [1], where it has been named 
Schwarz Alternating Procedure (SAP). 

Typically the solution of the block system e = Bjr is approximated by a few iterations of a 
Krylov subspace method (e.g. GMRES), and the DD method itself is in turn used as a precondi- 
tioner for a (flexible) Krylov subspace method. As illustrated in Figure 2 we observe that SAP is 
able to reduce error components belonging to a large part of the spectrum very well but a small part 
belonging to eigenvectors (EVs) to small eigenvalues (EWs) remains intractable. For larger config- 
urations the number of EWs with small magnitude of the Dirac operator gets larger, which yields 
an explanation why SAP is not able to remedy the scaling problem as the number of intractable 
eigenvectors increases as well. Though, the seen behavior of damping large EVs is desirable for an 
iterative method to be used as a smoother in a multigrid method and motivated us to use it in this 
context. 

3. Algebraic Multigrid 

A multigrid method typically consists of a simple iterative method called smoother and com- 
plementary coarse grid correction. As motivated in section 2 we deem SAP suitable for the use 
as a smoother since it is cheap to compute and reduces the error efficiently on a large part of the 
spectrum. The main idea of multigrid is to treat the error that is left after a few iterations of the 




Figure 1: block decomposed lattice (reduced to Figure 2: error component reduction in terms of 

2D) with 2 colors EVs on a 4 4 lattice with 2 4 blocks, EWs sorted 

by magnitude 
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smoother within a smaller subspace where the troublesome error components can be approximated. 
More precisely we want to define and solve a "coarse" linear system 



with a much smaller operator D c : C" c — > C" c in order to reduce the error on the critical part of ( 1 . 1 ). 
To this purpose we have to define linear maps R : C" — > C" c to restrict information based on the 
current residual r = b — Dz to the subspace and a linear map P : C" c — > C" to interpolate the infor- 
mation that we obtain from solving (3.1) back to C" where (1.1) is given. This yields a subspace 
correction 



with the corresponding error propagator 1 — PD~ RD. As D c should resemble the action of D on 
the troublesome subspace approximated by span{P), the action of D c is chosen as the action of D 
on interpolated vectors which are restricted afterwards. Formally this amounts to a Petrov-Galerkin 
formulation of the coarse operator as D c = RDP. With this choice of D c the error propagator of 
the subspace correction is given by 1 — P(RDP)~ l RD. In order to benefit from such a subspace 
correction, solving (3.1) has to be much cheaper than solving (1.1). That is n c should be small 
compared to n, and D c should be sparse. As the dimension of the troublesome subspace grows with 
n (cf. [8]) we do not want to fix n c (like in deflation methods) but want to find a sparse description 
of D c on that subspace. 

Once D c is found a basic two level algorithm consists of the alternating application of smoother 
and subspace collection. This procedure can be recursively extended by formulating a two level 
algorithm of this kind for the computation of (3.1) until we get an operator which is small enough 
to solve (3.1) directly. 

Aggregation Based Interpolation: We decided to adjust an aggregation based interpolation, 
that in turn yields the subspace correction, as the complementary component to the SAP smoother. 
Due to the fact that the coarse grid correction in (3.2) only acts on error components in range(P), it 
should approximate the subspace spanned by eigenvectors to eigenvalues of small magnitude of D 
(cf. Figure 2). In the multigrid literature, P is built from right, and R from left EVs corresponding 
to EWs of small magnitude of D. Due to the spectral properties of the Wilson Dirac operator it 
is natural to choose R = y^P. Furthermore, with additional assumptions on its structure we can 
choose R = P^ according to [5]. Therefore we define the aggregates in such a way that there exists 
Yi, such that 75P = Py^. Note, that with these assumptions on P the error propagator of the subspace 
correction (3.2) is given by z ^— z + P(P^DP)~ Y P^r. 

With this structure of P in mind, we define its entries based on a set of test vectors {vi, . . . , v^} 
whose span approximates the troublesome subspace and a set of aggregates {srf\, . . . ,=c^}.The ag- 
gregates can be realized as another block decomposition of the lattice. Note that the DD smoother 
and the interpolation do not have to share a common block decomposition. The interpolation P is 
then given by decomposing the test vectors over the aggregates (cf. Figure 3). Hence P is a linear 
map from the coarse grid to the fine grid, defined by 



D c Zc = b c 



(3.1) 



z^z + PD- [ Rr 



(3.2) 



Pe 



'j'- =I sfj v (0'-l)modJV)+l for j = l,---,N-S 
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Figure 3: construction of the interpolation (operator based point of view) 



where ej is the j-th unit vector. In order to have P f P = I, the test vectors are locally orthonormalized 
over the aggregates. Note that due to the aggregation structure of P and R the sparsity/connection 
structure of D c resembles the one of D, i.e., the corresponding graphs of the operators are regular 
four dimensional grids. Thus we can apply (3.2) recursively to (3.1) and obtain a hierarchy of 
coarser grids and coarser operators. This construction of P is similar to constructions found in 
[4, 5, 6, 8]. More precisely, the structure of the interpolation operators is identical but the test 
vectors v,- used to build them and thus the actions of the operators are different. 

Bootstrap Setup: A critical part of the construction of an efficient multigrid hierarchy is the 
computation of the test vectors used in the definition of P. The setup procedure we employ for this 
task is divided into two parts. We start with a set of random test vectors vi , . . . , vjy and apply a small 
number of smoother iterations S y to them. During this procedure the smoothed test vectors 5' v v ; are 
kept orthonormal. After that a temporary operator on the next coarser grid is computed according 
to the aggregation based construction. This procedure is continued recursively on the coarser grids 
until we get an operator on the coarsest grid. Herein we restrict smoothed test vectors with to 
the next coarser grid. In what follows we omit the level indices for the sake of simplicity. 

Since the subspace range(P) should contain a significant part of the eigenvectors to eigenval- 
ues of small magnitude of the fine grid space C", the EWs of small magnitude of D c = P^DP should 
approximate those of D. For the corresponding EVs the relations P^P = 1 and P^ (DP(p — XP(p) = 
imply DPcp « XP(p. Thus the second part of our setup procedure starts with the calculation of N 
approximations to the smallest EVs and EWs {(A,-, <p;)} of P^DP by means of harmonic Ritz vec- 
tors and values. The approximate EVs % for i = 1,. . . ,N are successively interpolated to the next 
finer grid and smoothed towards their harmonic Ritz values with some steps of SAP with iteration 
matrix In this case the local inverses for the pair (A,-, <p,) are given by 



The resulting set of vectors and the old test vectors V := {S(Xj) l P(pi : i = 1, . . . ,N} U {S v Vj : j = 
1, . . . ,Af} are again reduced to vectors. In order to preserve the most significant information, the 
,/V smallest singular values and their corresponding vectors of the 2N x 2N matrix (DV) t DV are 
calculated and the corresponding ,/V linear combinations of the vectors of V are the final vectors V. 
They define the interpolation from the next coarser grid to the current grid and the operator on the 
next coarser grid. The second part of the setup executes this procedure exactly once, starting on the 
coarsest grid. 



B J -(A ; -)=/^[4 7 (D-A ! -)/^r 1 4 ; . 
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Figures 4 and 5 illustrate the influence of the second part of the setup procedure on the coarse 
grid operator in a two level hierarchy. The smallest eigenvalues of the fine grid operator are much 
better represented on the coarse grid. For other setup procedures, see e.g. [4, 2, 3]. 
4. Results 

Our adaptive DD multilevel solver (aMG-DD) has been implemented in C using the paral- 
lelization interface of MP I . Krylov subspace methods have been implemented within a common 
framework for a fair comparison. For the numerical experiments we have combined our multigrid 
solver against CGN, i.e. CG on the normal equation. All results were produced with a two level 
method. The stopping criterion was to reduce the initial residual norm by a factor of at least 10 10 . 
The solutions of the block systems within SAP were approximated by 3 iterations of GMRES and 
the coarse grid equation (3.1) was approximately solved with 12 iterations of GMRES. For the 
outer flexible GMRES routine we chose a restart length of 25. All results have been computed on 
Juropa at the Jiilich Supercomputing Centre. 







FGMRES 








FGMRES 








+ aMG-DD 


CGN 






+ aMG-DD 


CGN 


setup 


timings 


45.22s 


setup 


timings 


24.85s 




solve 


iterations 


9 


4476 


solve 


iterations 


17 


9181 




timings 


6.97s 


121.82s 




timings 


5.05s 


119.04s 


total 


timings 


52.19s 


121.82s 


total 


timings 


29.90s 


119.04s 



Table 1: 48 x 24 3 Clover Wilson-Dirac, j3 = 5.3 
= 2.56 GeV), K = 0.13590 (m n = 630 MeV), 
c sw = 1.90952, generated using public code with pa- 
rameters from L. Del Debbio [9], block-size 3 4 , coars- 
ening 3 4 x 6 20, 512 cores (Juropa at JSC). 



Table 2: 64 x 32 3 Wilson-Dirac 2HEX smeared 
tree level improved Clover, j3 = 3.5 (1 /a — 
2.130 GeV), K = 0.12646 (m« = 300 MeV), 
c sw = 1.0, provided by the BMW collabora- 
tion [10, 11], block-size 2 4 , coarsening 4 4 x 6 — >• 
20, 4096 cores (Juropa at JSC). 

In the cases shown our multigrid solver was 17 and 24 times faster than CGN. Including the 
setup time we are still twice and four times as fast as CGN. Since the setup has to be done only 
once the benefits of our approach are larger the more right hand sides there are to be solved. 
5. Summary and Outlook 

The developed method combining DD techniques and algebraic multigrid shows great poten- 
tial to speed-up calculations of propagators in lattice QCD. Even for single right hand sides our 
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method outperforms conventional Krylov subspace methods with the potential of an even more 
significant speed-up when solving for many right hand sides. This result is mainly due to the in- 
troduction of the highly parallel DD smoother and the bootstrap setup into the algebraic multigrid 
method. While the first speeds up the setup of the method and the subsequent solution the latter 
significantly speeds up the setup of the multigrid hierarchy, which in general is a bottleneck in 
algebraic multigrid methods especially compared to setup-free Krylov subspace methods. We are 
currently working on an optimized version of the code that should be able to run on large-scale 
parallel machines and extend our testing of the method towards larger lattices and lighter quark 
masses. In the near future we plan to incorporate our algorithm into the production codes of our 
collaborators within SFB TR55. 

Acknowledgments: This work is funded by Deutsche Forschungsgemeinschaft (DFG) Tran- 
sregional Collaborative Research Centre 55 (SFB TR55). 
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